Development of the method for filtering verbal noise while search keywords for the English text
DOI:
https://doi.org/10.15587/2312-8372.2018.149962Keywords:
verbal noise filtering, English text keywords, linguistic package, DKPro Core, syntactic analysisAbstract
The object of research is the processing of verbal information to identify keywords in the text. The most important step in the search for key terms is the calculation of their weights in the document in question, which makes it possible to evaluate their significance relative to each other in this context. To solve this problem, there are many approaches that are conditionally divided into two groups: they require learning and do not require learning. Learning implies the need to pre-process the original body of texts in order to extract information about the frequency of occurrence of terms in the entire body. An alternative approach is using linguistic ontologies, which are more or less approximate models of the existing set of words in a given language. On the basis of both approaches, systems are created for the automatic extraction of key terms. Nevertheless, in the direction of searching for keywords, research is not stopped in order to improve the accuracy and completeness of the results, as well as to use methods of extracting information from the text to solve new problems.
Existing approaches to the definition of keywords are characterized. The best quality of text processing is achieved by linguistic methods or when their combinations are statistical. A system for automatically determining key phrases from natural language text should be developed using the morphological dictionary and syntax rules.
The study uses an approach to defining keywords based on finding syntactic links between word forms in sentences in English text using the instrumental capabilities of modern linguistic packages. In the framework of the general approach to reducing verbal noise in the method, it is proposed that it is achieved with the help of formalized operations: the replacement of pronouns with the corresponding nouns; removal of noise connections; removing noise words; withdrawal of stop words. The described operations can be used as additional modules that improve the results of finding keywords for both the developed method for determining keywords of English text and other algorithms for finding keywords.References
- Ershov, Yu. S. (2014). Vydelenie klyuchevykh slov v russkoyazychnykh tekstakh. Molodezhnyy nauchno-tekhnicheskiy vestnik, FS77-51038, 70–79.
- Grashhenko, L. A. (2013). O model'nom stop-slovare. Izvestiya Akademii nauk Respubliki Tadzhikistan. Otdelenie fiziko-matematicheskikh, khimicheskikh, geologicheskikh i tekhnicheskikh nauk, 1 (150), 40–46.
- Andreev, A. M., Berezkin, D. V., Syuzev, V. V., Shabanov, V. I. (2003). Modeli i metody avtomaticheskoy klassifikatsii tekstovykh dokumentov. Vestn. MGTU. Seriia Priborostroenie, 3, 64–94.
- Abramov, E. G. (2011). Podbor klyuchevykh slov dlya nauchnoy stat'i. Nauchnaya periodika: problemy i resheniya, 1 (2), 35–40.
- Darkulova, K. N., Ergeshova, G. (2014). Neobkhodimost' vydeleniya klyuchevykh slov dlya svertyvaniya teksta. Lingvisticheskiy analiz nauchnogo teksta. Yuzhno-Kazakhstanskiy gosudarstvennyy universitet im. Mukhtara Auezova Shymkent, 30–35.
- Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation techniques. Journal of intelligent information systems, 17 (2-3), 107–145. doi: http://doi.org/10.1023/a:1012801612483
- Barahnin, V. B., Tkachev, D. A. (2010). Clustering of text documents based on composite key terms. Vestnik NSU. Series: Information Technology, 8 (2), 5–14.
- Grashhenko, L. A. (2013). O model'nom stop-slovare. Izvestiya Akademii nauk Respubliki Tadzhikistan. Otdelenie fiziko-matematicheskikh, khimicheskikh, geologicheskikh i tekhnicheskikh nauk, 1 (150), 40–46.
- Guo, A., Tao, Y. (2016). Research and Improvement of Feature Words Weight Based on TFIDF Algorithm. 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. Chongqing. doi: http://doi.org/10.1109/itnec.2016.7560393
- Grineva, M., Grinev, M., Boldakov, A., Novak, L., Syssoev, A., Lizorkin, D. (2009). Sifting Micro-blogging Stream for Events of User Interest. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Boston, 327–333. doi: http://doi.org/10.1145/1571941.1572157
- Reed, J., Jiao, Y., Potok, T., Klump, B., Elmore, M., Hurson, A. (2006). TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams. 2006 5th International Conference on Machine Learning and Applications. Orlando, 258–263. doi: http://doi.org/10.1109/icmla.2006.50
- Mihalcea, R., Csomai, A. (2007). Wikify!: linking documents to encyclopedic knowledge. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. Lisbon, 233–242. doi: http://doi.org/10.1145/1321440.1321475
- Astrakhantsev, N. (2014). Automatic term acquisition from domain-specific text collection by using Wikipedia. Proceedings of the Institute for System Programming of RAS, 26 (4), 7–20. doi: http://doi.org/10.15514/ispras-2014-26(4)-1
- Özgür, A., Hur, J., He, Y. (2016). The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature. BioData Mining, 9 (1). doi: http://doi.org/10.1186/s13040-016-0118-0
- Wong, W., Liu, W., Bennamoun, M. (2012). Ontology learning from text. ACM Computing Surveys, 44 (4), 1–36. doi: http://doi.org/10.1145/2333112.2333115
- Korobkin, D. M., Fomenkov, S. A., Kolesnikov, S. G. (2015). Method of ontology-based extraction of physical effect description. Vestnik Komp’iuternykh i Informatsionnykh Tekhnologii, 28–35. doi: http://doi.org/10.14489/vkit.2015.02.pp.028-035
- Besplatnyy onlayn-generator klyuchevykh slov s teksta. Available at: http://seotool.by/analiz/seo/keywordstext.php
- Generator klyuchevykh slov s teksta. Available at: http://www.rise-top.com
- Advego. Available at: http://wiki.advego.ru/index.php/Адвего
- Natural Language Processing: Integration of Automatic and Manual Analysis (2014). Available at: http://tuprints.ulb.tu-darmstadt.de/4151/1/rec-thesis-final.pdf
- Bisikalo, O. V., Wójcik, W., Yahimovich, O. V., Smailova, S. (2016). Method of determining of keywords in English texts based on the DKPro Core. Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2016. doi: http://doi.org/10.1117/12.2249225
- Determiner. Available at: http://universaldependencies.org/u/dep/det.html
- Expletive and Reflexives. Available at: http://universaldependencies.org/u/dep/expl.html
- Welo, E. (2013). Null Anaphora. Encyclopedia of Ancient Greek Language and Linguistics. doi: http://doi.org/10.1163/2214-448x_eagll_com_00000254
- Manning, C., de Marneffe, M. (2016). Stanford typed dependencies manual. Available at: https://nlp.stanford.edu/software/dependencies_manual.pdf
- Fixed multiword. Available at: http://universaldependencies.org/u/dep/fixed.html
- Punctuation. Available at: http://universaldependencies.org/u/dep/punct.html
- Root. Available at: http://universaldependencies.org/u/dep/root.html
- Taylor, A., Marcus, M., Santorini, B. (2003). The Penn Treebank: An Overview. Text, Speech and Language Technology, 5–22. doi: http://doi.org/10.1007/978-94-010-0201-1_1
- Penn Treebank II Constituent Tags: Word level. Available at: http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html#Word
- Alphabetical list of part-of-speech tags used in the Penn Treebank Project. Available at: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- Bougé, K. Lists of stop words. Available at: https://sites.google.com/site/kevinbouge/stopwords-lists
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2018 Oleg Bisikalo, Alexander Yahimovich, Yaroslav Yahimovich
This work is licensed under a Creative Commons Attribution 4.0 International License.
The consolidation and conditions for the transfer of copyright (identification of authorship) is carried out in the License Agreement. In particular, the authors reserve the right to the authorship of their manuscript and transfer the first publication of this work to the journal under the terms of the Creative Commons CC BY license. At the same time, they have the right to conclude on their own additional agreements concerning the non-exclusive distribution of the work in the form in which it was published by this journal, but provided that the link to the first publication of the article in this journal is preserved.